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Abstract 

The learning curve expresses the error rate of a predictive modehng procedure as a 
function of the sample size of the training dataset. It typically is a decreasing, convex 
function with a positive limiting value. An estimate of the learning curve can be used 
to assess whether a modeling procedure should be expected to become substantially 
more accurate if additional training data become available. This article proposes a 
new procedure for estimating learning curves using imputation. We focus on classifica- 
tion, although the idea is applicable to other predictive modeling settings. Simulation 
studies indicate that the learning curve can be estimated with useful accuracy for a 
roughly four-fold increase in the size of the training set relative to the available data, 
and that the proposed imputation approach outperforms an alternative estimation ap- 
proach based on parameterizing the learning curve. We illustrate the method with 
an application that predicts the risk of disease progression for people with chronic 
lymphocytic leukemia. 



1 Introduction 

Predictive models describe the relationship between an outcome and a set of predictor vari- 
ables, and are widely used in areas ranging from personalized medicine to computational 
advertising. For example, in personalized medicine the aim may be to predict which pa- 
tients with a particular disease are likely to respond favorably to a treatment based on 
information contained in a set of pre-treatment biomarkers (Insel 2009|). Predictive models 



are developed using a training data set, and their "generalization performance" is typically 
assessed with respect to a test set that is independent of the training set. Minimizing some 
measure of the generalization error rate is usually the first priority in predictive modeling, 
although issues such as model simplicity, interpretability, and ease of implementation may 
also be important. 

One common type of predictive modeling is binary classification, in which the outcome Y 
takes on one of two possible values. In classification problems, the expected misclassification 
rate is a natural measure of generalization performance. Suppose we observe a training set 
of n feature-label pairs V^ = {(-^i, ^i)}"=i (capital letters such as X and Y denote random 
variables and lower case letters such as x and y denote instances of these random variables 
throughout). Using the training data Vn, we can construct a classifier c„, say using logistic 
regression. The goal is to use the classifier c„ to accurately predict the labels Y from the 
observed features X of unlabeled cases. The expected misclassification rate T{n) of the 
classifier c„ is the expected proportion of incorrectly labeled features X averaged over both 
the feature-label distribution of test cases and the distribution of Vn] that is, 

r{n) ^ E{Y ^ c„(X)} = E^E{Y ^ c„(X)|Pj, (1) 

where E denotes expectation taken with respect to both (X, F) and the training data P„. 
Our primary focus here is to estimate the function r(n), which has been termed the "learning 



curve" dAmari et alT] |1992|, [Haussler et al.| |1996|, [Hastie et al.| |2009| page 243). 



Knowledge of the learning curve can contribute to both study design and interpretation 
of predictive modeling results. Study design questions arise naturally if the training data are 
acquired in two or more stages, or were obtained from a pilot study with a modest sample 
size. Such a study may produce encouraging evidence that a useful predictive relationship 
exists, but one naturally expects that a predictive model obtained from a small training 
set will not perform as well as one obtained using a larger training set. If we learn that 
the expected generalization performance of a rule obtained by following the study design 
employed in our pilot study can be substantially improved by using a larger training set, 
we would be encouraged to conduct a larger study using the same features and predictive 
modeling approach. 

Learning curves can also contribute to the interpretation of predictive modeling results. 
In many settings in which predictive modeling is applied, the variables naturally fall into 
domains. For example, many early genomic studies investigating risk prediction for cancer 
outcomes used the expression levels of genes associated with cell growth, division, and prolif- 
eration to inform the predictions. When considering the performance of these early studies, 
which often had modest sample sizes, it was natural to ask whether their performance could 
best be improved by using larger training sets, or by considering additional classes of genes 
such as those involved in resistance to chemotherapeutic agents or inhibition of the immune 
response. A similar question arises when considering the integration of data from different 
domains that may influence disease outcomes, such as environmental influences, measured 
metabolite levels, and inherited genetic factors. 

The rest of the paper is organized as follows. Section [2] reviews some previous work on 
learning curves. Section [3] describes three approaches to learning curve estimation, including 
one existing approach and two new approaches. Section |4] compares the performances of these 
approaches using several simulated examples. Section [5] illustrates the imputation approach 



using a data set in which the goal is to predict the risk that a patient with chronic lymphocytic 
leukemia (CLL) will experience a poor outcome. Section [Glprovides some concluding remarks. 

2 Learning curves 

Learning curves have been an object of interest for several decades. Bounds on the learn- 



ing curve follow from the work of Vapnik and Chervonenkis (Vapnik and Chervonenkis 



1971 , Vapnik 1982|). These bounds have the power law form a + b/m°', where a = 1/2 



holds in the most realistic settings. The bounds are tight if nothing is known about the 
distribution of X and one takes a worst-case perspective. If information about the distribu- 
tion of X is available or can be estimated from observed X values, tighter bounds can be 
obtained (e.g. Haussler et al. 1996| ). 



The problem we consider here is to estimate the learning curve rather than to bound it. 
Thus we consider the setting in which data on (X, Y) are available, on which the estimation 
can be based, and we focus on traditional criteria for statistical estimation such as bias and 
variance, rather than on obtaining bounds. To succeed at this estimation, we must capture 
the general form of the function (e.g. the order at which the function r changes with n), but 
also the relevant constants and lower order terms. 

Our focus here is on binary classification, but for comparison we briefly consider the 
setting of linear regression using least-squares methods. In this case, an expression for the 
learning curve T{n) can be derived explicitly. The generalization performance in this case 
is naturally assessed using the mean-square prediction error (MSPE) E[{Y — Y^] for a 
prediction Y = Y{x) of the unobserved Y given its feature vector X = x. For training sets 
of sample size n, the expected MSPE is ^^(1 + tr(CM)/n), where M = E[{X'X/n)-^] for 
the training design matrix X and C = E[x*x*'] for the test set covariate vector x*. The 
reduction in MSPE due to the use of a larger training set is reflected in the term a'^c/n, 



where c = tr(CM) captures both the complexity of the model and the similarity of the 
training and testing distributions of covariate vectors. We note that on the more natural 
scale RMSPE = [MSPE]^/^, this learning curve would have the form a + b/n^'^. 

3 Approaches to learning curve estimation 

In this section, we describe three approaches to learning curve estimation. The first approach 



follows a proposal of Mukherjee et al. 2003 . The second and third approaches are new, to 
our knowledge. 

3.1 Estimating the learning curve via subsampling and extrapo- 
lation 

In 2003, Mukherjee et al. described an approach to learning curve estimation based on 
parameterizing the learning curve. We are not aware that a name has been given to this 
method, and therefore we termed it "SUBEX" for "subsampling and extrapolation." The 
method parameterizes the learning curve as an inverse power law of the form r(m) = a + 
hm~°'. As noted above, this expression is exact in the case of linear least squares regression, 
but may be inexact in the case of classification using logistic regression. The unknown 
parameters in this expression are a G M and 6, a > 0. This parametric form is fit by first 
using cross-validation on subsamples of the data of various sizes m' < m to obtain direct 
estimates of T{m'). Specifically, for a given m' < m, we can subsample B subsets of the 
training data of size m', fit the classification model to each subset, and use the complementary 
m — m' samples to unbiasedly estimate the error rate. These B error rate estimates can be 
averaged to estimate T{m'). The parametric form for r(m) is then fit to these values to 
estimate a, b, and a using some form of nonlinear regression. For example, nonlinear least 
squares would estimate a, b and a by minimizing 
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^(f(mfc)-a-6m^")^ 
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where the rrik < m are a set of sample sizes on which the error rate is directly estimated. 

As shown in Section|3} the SUBEX estimator can be positively biased, conveying an overly 
optimistic assessment of the generalization performance. One reason for this optimism is the 
asymmetry in the constraints placed on r. The curve is constrained to be non-increasing, 
which is quite natural, but owing to the high variance in cross-validation estimators, the 
constraint is active on a non-negligible proportion of modest-sized training sets under simple 
generative models. By contrast, aside from being constrained to be non-negative, the curve 
is unrestricted in how rapidly it can decrease. Thus, when averaged over training sets, a 
negative bias in the learning curve results. 

3.2 Estimating the learning curve via imputation and interpola- 
tion 

The second approach we consider for estimating the learning curve uses data imputation and 
interpolation, hence this approach is termed "IMPINT." In this approach, one first estimates 
the joint distribution of the feature-label pair {X, Y). This estimation is generally performed 
by separately estimating the feature distribution px, and the conditional distribution Py\x of 
the label Y given the feature X. After estimating the joint distribution, one can synthesize 
data sets of any size. The ability to simulate such data sets allows direct estimation of any 
point on the learning curve. Specifically, one can generate an arbitrary number of training 
sets of a given size m from PxPy\x, build a classifier on each one, and then average the 
generalization performance over newly drawn feature- label pairs from PxPy\x- The complete 
learning curve can be obtained via interpolation among the learning curve points that are 
directly estimated. 



We now describe in detail the IMPINT procedure based on logistic regression. Suppose we 
observe a training set P„ of feature-label pairs {{Xi,Yi)}f^^ drawn i.i.d. from an unknown 
distribution with density px,Y{x-,y)- The features X take values in W, and the binary 
labels Y are coded to take values in {0,1}. Define 7r(s; /3) = logit(x"''/3) = 1/(1 + e~^''^). 
The distribution px,Y{x,y) factors into the product pY\x{y\x)px{x). Under the logistic 
regression model the conditional distribution of Y\X has the form PY\x{y\x) = vr(x; /3*)^(1 — 
tt{x; /3*)y^'^, with /3* G W denoting the unknown true parameter value. The marginal 
distribution of the label X, denoted by pxi^), is left unspecified for now. 

The conditional error rate for a particular training set V^ is 



R{V^- /3*) ^ / [7r(a;; /3*)l{a;T/3„ <«:} + (!- n{x; /3*))l{x^^m > ^}\ Px{x)dx, (2) 

where Pm = Pm{^m) is the maximum likelihood estimator of /3* and the classification rule 
is given by Cmix) = l{x'^/3rn > k.} for some threshold k G M. Using ([l|, we can express the 
learning curve as 



/m 
RiP^.n \[px{xi)pY\x{yi)dxdyi- 



(3) 



If (3* and pxi^) were both known, one could compute T{m) for each m using ([s]), for 
example, using Monte Carlo methods to approximate the 2m + 1-dimensional integral with 
arbitrary accuracy. More specifically, using /3* and px{x) one could generate B training sets 
Vm K'Dm , ■ ■ ■ -iT^m , cach of size m. Fitting a logistic regression model on the 6*^ training 
set yields the estimator (5m of /3*. Furthermore, one could use (5* and px{x) to generate a 
large test set V^. For sufficiently large B and N , it follows that 



6=1 {xy)&v* 

^M^E E \^ -nx'P^:^^ <^} + {i-Y)-i{x^pt^ > n}]. (4) 

6=1 {X,y)GX'. 

The IMPINT estimate of T{m), which we denote by fjj{m), is formed by applying the 
approximation given in Q over imputed training and testing sets generated using an im- 
putation model fit to the complete observed training data Vn- More specifically, let px{x) 
denote an estimator of px{x) and /3„ denote the usual maximum likelihood estimator of 
/3*. Note that labeled data is not needed to estimate px{x), hence if additional unlabeled 
features are available they can be used to improve the estimation oi px{x). The estimators 
Px{x) and /3„, substituted for px{x) and (3* respectively, can be used to impute B training 
sets Vin ^Vm , • • • , Vm , each of size m, and to sample a large synthetic test set V^:. Using 
Ml), the IMPINT estimator is given by 

riiM^^f^ E [Y-i{x^p^r!:^<n} + {i-Y).i{x^p^j:^>K}], (5) 

where /3m = Pm {i'L) denotes the maximum likelihood estimator based on the b^^ imputed 
training set Vin . 

The model px for px can be obtained using any appropriate modeling approach for 
multivariate data. Some possible approaches are demonstrated in the simulation studies 
and real data analysis below. To avoid model mis-specification, it is tempting to simply 
use the empirical distribution function of X in place of px- However, in our experience, 
this approach does not work well with continuous covariates. In particular, the estimated 
learning curve tends be substantially biased downward. This may occur because points in 



the training set have positive mass in the testing population. Thus, the model is not relied 



upon to interpolate probabilities between observed X points; see Efron 1983 for a discussion 
of the role played by the distance between training and testing sets in classification. 

We found that the number of imputed data sets required to ensure that the IMPINT 
estimate is smoothly non- increasing can be relatively large. As a practical matter, it is more 
computationally efficient to use a smaller value of B (e.g. B ^ 500 — 1000), and then feed 
the resulting estimate through a monotone smoother (e.g., Friedman and Tibshirani 1984; 
alternatively a parametric model could be fit as in Mukherjee et al. 2003). We found that 
the use of a monotone smoother reduces variance without introducing detectable additional 
bias. 

3.3 Bias reduction for learning curve estimates 

We found that a simple bias reduction substantially improves the performance of the learning 
estimates obtained from imputed data. This leads to a modified IMPINT approach, which 
we call BRIE for "Bias Reduced Imputation Estimator." To motivate this approach, consider 
what happens when we estimate Py\x using the best-fitting regression model (e.g., a fitted 
logistic regression model). This will overstate the strength of the relationship between X and 
Y, particularly when the true relationship is weak (e.g., if Y and X are independent, then 
Py\x will still exhibit a relationship). Thus, the IMPINT estimator tends to be an optimistic 
estimator of 'r(m), in the sense that it systematically overstates predictive accuracy. A simple 
bias correction addresses this problem. 

We observed empirically (see section p| that for any positive integer k, the estimator 
T'n{f^) ~ T'ii{fn + k) exhibits little bias as an estimator of r(?7i) — r(m + k). That is, 
the IMPINT estimator is nearly unbiased for the increments of the learning curve but not 
necessarily for its overall level. An explanation for this observation parallels the intuition 
behind the bootstrap as follows. The asymptote of the learning curve linim^oo T(m) is the 
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Bayes error rate and thus depends exclusively on px{x) and f3*. However, the increments 
of r, in addition to depending on f3* and px{x), depend on the sampling properties of 
the estimator Pm as well as the convergence behavior of (3m to its limiting value /3*. The 
increments of the IMPINT learning curve estimator fii{n) are determined by the sampling 
properties of (3m as an estimator of /3„ as well as the manner by which (3m approaches 
its limiting value. Thus, if the sampling properties of (3m about /3„ accurately reflect the 
sampling properties of /3„ about (3* and if px{x) is a reasonable estimator of px{x), then it 
is possible the increments of f// approximate the increments of r. 

If one can accurately estimate the increments of the learning curve, all that remains 
is to find an unbiased estimator of the learning curve at a single training set size. This is 
provided by the leave-one-out cross-validation (LOOCV) estimator of the expected test error 
based on the complete observed training set P„, which provides an unbiased estimator of 
r(n — 1). Let Tcvii^ — 1) denote the LOOCV estimator of r(n — 1). The BRIE is then given 
by TBim) = fiiim) + {jcvin — 1) — ^//(^ — !))• Thus, BRIE is simply a shifted version of 
the IMPINT estimator. 

The choice to use the LOOCV estimator of the misclassification rate is not essential. 
Noting that r(n) ~ r(n — 1), one could employ any unbiased (or nearly unbiased) estimator 
of the expected test error T{n) to recenter the uncorrected estimator of the learning curve. 
There are also other ways to achieve this bias correction. For instance, the MLE /3„ of 
(3 derived from the training set may be rescaled by a factor c < 1, producing a shrunken 
coefficient vector (3. We found that this approach gives similar results to those obtained 
using the simple additive bias correction. 
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4 Empirical studies 

In this section we examine the performance of the SUBEX, IMPINT, and BRIE procedures 
in terms of bias and variance, using a series of simulation studies. For each example we used 
1000 Monte Carlo iterations, B = 1000 imputed data sets, and an imputed test set 2^=,, of size 
5000. All initial training sets are of size n = 50 and we estimate r(?Ti) for m = 75, 100, 150, 
and 200 using this initial training set. Thus, we are attempting to extrapolate substantially 
beyond the initial training set size. 

For our simulation studies, we consider the following class of models. The distribution of 
the features X is a p-variate normal distribution with isotropic variance-covariance matrix S 
given by Sj j = r'*^-^'. The true parameter vector (3* is a p- vector of ones. Thus, this class of 
models is determined by the dimension of the model p and the level of dependence between 
the features which is governed by the parameter r G (—1, 1). 

Figure [T] shows a few examples of learning curves generated using this class of models. 
The left hand side of Figure [I] shows the learning curve r(m) based on training sets of size 
n = 50, for p fixed at 15 and r = 0.0,0.25,0.50, and 0.75. The figure shows that as r 
increases, both the learning rate and the asymptotic error rate (i.e. the Bayes error rate) 
decrease (differences in the learning rate are most evident for sample sizes less than 100). 
These changes are mostly driven by the fact that as the positive correlation r increases, 
the distribution of IX"""/?*! becomes stochastically larger, and thus fewer points lie near the 
optimal decision boundary {x G M^ : x'^(3* = 0}. The right hand side of Figure [I] shows the 
learning curve r(m) based on training sets of size n = 50 for r fixed at 0.10 and p = 10, 15, 20, 
and 25. The figure shows, as would be expected, that the learning curve becomes steeper as 
the dimension of the problem increases (the raw values of r^ are difficult to compare across 
values of p since the Bayes error rate changes with p). We use these examples to examine 
bias and variance properties of the BRIE and SUBEX estimator. 
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Figure 1: Left: Learning curves T{m) for training sets of size n = 50, for the isotropic 
normal model with p = 15 and r = 0, .25, .5, .75. Right: Learning curves r(m) for training 
sets of size n = 50 for the isotropic normal model with r = .10 and p = 10, 15, 20, 25. 

4.1 Task one: estimating the improvement in expected error rate 

We first consider the task of estimating the improvement in expected error rate if additional 
training data are obtained. That is, our goal is to estimate S{n, m) = T{n) — T{m) for m > n. 
Note that the plug-in BRIE estimate of 6{n,m), 6B{n,m) = Tsin) — Tsin), is equal to the 
plug-in IMPINT estimate of 5{n,in). Estimation of 5{n,m) is a somewhat easier problem 
than estimating the entire learning curve T{m), as it only requires estimating the shape but 
not the absolute level of the learning curve. This quantity may be of interest to a researcher 
who cares more about relative improvement, e.g. a 5% reduction in expected error rate, than 
absolute improvement in error rate, e.g. a reduction from 12% to 7%. 

Tables 1 and 2 show the estimated expected values and standard deviations for the BRIE 
(IMPINT) and SUBEX estimators, on a class of eight models as described at the beginning 
of this section. The BRIE exhibits substantially smaller bias and standard deviation than 
the SUBEX estimator in all instances, and provides useful estimates even when extrapolating 
from n = 50 to m = 200, a four-fold increase in sample size. The reduction in variability in 
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BRIE relative to SUBEX presumably results from the variance in SUBEX resulting from the 



use of the LOOCV estimator fcv{n — l), which is well-known to be highly variable (Toussaint 



1974 ; Efron 1983 ; Snapinn and Knoke 1985 , Breiman and Spector 1992|). The bias in 



the SUBEX estimator could be due to the true value of r not following an inverse power 
law exactly, or could be due to the asymmetric constraints imposed on the fitted curve, as 
discussed above. 

In tables 1 and 2, the marginal distribution px{x) was estimated using maximum like- 
lihood for a normal model with unknown mean vector /i and unknown variance-covariance 
matrix E given by Sjj = cr^p'*"-'!. Thus, this simulation provides the BRIE estimator the 
advantage of knowing the form of the covariance matrix. 

Tables 3 and 4 show the analogous results when px{x) is modeled as a multivariate normal 
distribution with unconstrained variance-covariance matrix. The usual plug-in estimator of 
the covariance is used. The tables show that the BRIE estimator still exhibits substantially 
smaller bias and variability than the SUBEX estimator. However, the standard deviation of 
the BRIE is on average about twice as large compared with using the constrained covariance 
estimate as in tables 1 and 2. 

4.2 Task two: estimating the learning curve 

Next we consider the more difficult task of estimating the full learning curve, not just its 
increments. Our goal is to estimate r(75), r(lOO), r(150), and r(200) using a training set 
size of n = 50. As in the previous section, we use the SUBEX estimator as a baseline for 
comparison. 

Tables (tsl) and ^ show the estimated expected values and standard deviations for the 
BRIE and the SUBEX estimator of the learning curve on the same eight models considered 
in the preceding section. Like the results for estimating the improvement in error rate, 
the BRIE shows only negligible bias for estimating the learning curve, while the bias for 
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V 
10 
15 
20 
25 
15 
15 
15 
15 



r 
0.10 
0.10 
0.10 
0.10 
0.00 
0.25 
0.50 
0.75 



(5(50,75) 
0.0173 
0.0232 
0.0266 
0.0294 
0.0210 
0.0222 
0.0100 
0.00590 



(5^(50,75), SD 
0.0145, 0.00477 
0.0253, 0.00362 
0.0327, 0.00382 
0.0328, 0.00347 
0.0263, 0.00344 
0.0226, 0.00375 
0.0169, 0.00462 
0.00986, 0.00478 



(55(50,75), SD 
0.0254, 0.0296 
0.0297, 0.0323 
0.0383, 0.0397 
0.0400, 0.0384 
0.0296, 0.0321 
0.0272, 0.0317 
0.0235, 0.0287 
0.0147, 0.0166 



(5(50, 100) 
0.0249 
0.0362 
0.0415 
0.0475 
0.0372 
0.0300 
0.0180 
0.00824 



(5^(50,100), SD 
0.0209, 0.00613 
0.0376, 0.00461 
0.0497, 0.00499 
0.0498, 0.00450 
0.0390, 0.00440 
0.0340, 0.00478 
0.0259, 0.00615 
0.0151, 0.00651 



(5g(50, 100), SD 
0.0415, 0.0486 
0.0484, 0.0531 
0.0625, 0.0650 
0.0653, 0.0630 
0.0483, 0.0527 
0.0444, 0.0520 
0.0383, 0.0471 
0.0238, 0.0271 



Table 1: Mean and standard deviation of the BRIE and SUBEX estimator for estimating 
the improvement in the expected error rate when the training set size is increased to n = 50 
to m = 75 and m = 100. The BRIE estimator used a multivariate normal model with 
the restriction that Sjj = (j^p'*"-'! to estimate px{x). The BRIE estimator is seen to be 
significantly less biased than the SUBEX estimator, as well as, possessing smaller standard 
deviation across training sets. 



p r 


(5(50,150) 


(5b(50,150), SD 


(55(50,150), SD 


(5(50,200) 


(5b(50,200), SD 


(55(50,200), SD 


10 0.10 


0.0322 


0.0268, 0.00679 


0.0616, 0.0727 


0.0388 


0.0285, 0.00686 


0.0742, 0.0880 


15 0.10 


0.0484 


0.0497, 0.00496 


0.0720, 0.0794 


0.0563 


0.0556, 0.00490 


0.0870, 0.0963 


20 0.10 


0.0614 


0.0673, 0.00523 


0.0930, 0.0971 


0.0699 


0.0765, 0.00541 


0.112, 0.117 


25 0.10 


0.0684 


0.0674, 0.00484 


0.0973, 0.0943 


0.0823 


0.0767, 0.00469 


0.117,0.124 


15 0.00 


0.0487 


0.0512, 0.00478 


0.0719, 0.0789 


0.0576 


0.0571, 0.00476 


0.0868, 0.0956 


15 0.25 


0.0459 


0.0456, 0.00506 


0.0660, 0.0778 


0.0525 


0.0515, 0.00487 


0.0797, 0.0943 


15 0.50 


0.0313 


0.0356, 0.00684 


0.0570, 0.0705 


0.0357 


0.0410, 0.00673 


0.0687, 0.0855 


15 0.75 


0.0150 


0.0210, 0.00763 


0.0351, 0.0405 


0.0189 


0.0244, 0.00791 


0.0421, 0.0490 



Table 2: Mean and standard deviation of the BRIE and SUBEX estimator for estimating 
the improvement in the expected error rate when the training set size is increased to n = 50 
to m = 150 and m = 200. The BRIE estimator used a multivariate normal model with 
the restriction that Sjj = cr^pl*^-'! to estimate px{x)- The BRIE estimator is seen to be 
significantly less biased than the SUBEX estimator, as well as, possessing smaller standard 
deviation across training sets. 
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V 
10 
15 
20 
25 
15 
15 
15 
15 



r 
0.10 
0.10 
0.10 
0.10 
0.00 
0.25 
0.50 
0.75 



(5(50,75) 
0.0173 
0.0232 
0.0266 
0.0294 
0.0210 
0.0222 
0.0100 
0.00590 



_M50, 
0.0100, 
0.0209, 
0.0240, 
0.0256, 
0.0217, 
0.0184, 
0.0140, 

0.00879 



75), SD 
0.00206 
0.00670 
0.00372 
0.00748 
0.00704 
0.00620 
0.00568 

, 0.00413 



(5g(50,75), SD 
0.0254, 0.0296 
0.0297, 0.0323 
0.0383, 0.0397 
0.0400, 0.0384 
0.0296, 0.0321 
0.0272, 0.0317 
0.0235, 0.0287 
0.0147, 0.0166 



(5(50, 100) 
0.0249 
0.0362 
0.0415 
0.0475 
0.0372 
0.0300 
0.0180 
0.00824 



(5b(50, 100), SD 
0.0140, 0.00330 
0.0314, 0.00865 
0.0372, 0.00519 
0.0404, 0.0102 
0.0325, 0.00910 
0.0280, 0.00806 
0.0215, 0.00750 
0.0135, 0.00554 



(5s(50, 100), SD 
0.0415, 0.0486 
0.0484, 0.0531 
0.0625, 0.0650 



0.0653, 
0.0483, 



0.0630 
0.0527 



0.0444, 0.0520 
0.0383, 0.0471 
0.0238, 0.0271 



Table 3: Mean and standard deviation of the BRIE and SUBEX estimator for estimating 
the improvement in the expected error rate when the training set size is increased to n = 50 
to m = 75 and m = 100. The BRIE estimator used an unrestricted multivariate normal 
model to estimate px{x). The BRIE estimator is seen to be significantly less biased than 
the SUBEX estimator, as well as, possessing smaller standard deviation across training sets. 



p r 


(5(50,150) 


(5b(50,150), SD 


(55(50,150), SD 


(5(50,200) 


(5b(50,200), SD 


(55(50,200), SD 


10 0.10 


0.0322 


0.0202, 0.00476 


0.0616, 0.0727 


0.0388 


0.0240, 0.00563 


0.0742, 0.0880 


15 0.10 


0.0484 


0.0421, 0.00939 


0.0720, 0.0794 


0.0563 


0.0477, 0.00918 


0.0870, 0.0963 


20 0.10 


0.0614 


0.0519, 0.0101 


0.0930, 0.0971 


0.0699 


0.0603, 0.00998 


0.112, 0.117 


25 0.10 


0.0684 


0.0578, 0.0118 


0.0973, 0.0943 


0.0823 


0.0682, 0.0118 


0.117,0.124 


15 0.00 


0.0487 


0.0435, 0.00991 


0.0719, 0.0789 


0.0576 


0.0492, 0.00974 


0.0868, 0.0956 


15 0.25 


0.0459 


0.0381, 0.00878 


0.0660, 0.0778 


0.0525 


0.0436, 0.00857 


0.0797, 0.0943 


15 0.50 


0.313 


0.0298, 0.00839 


0.0570, 0.0705 


0.0357 


0.0345, 0.00836 


0.0687, 0.0855 


15 0.75 


0.0150 


0.0188, 0.00636 


0.0351, 0.0405 


0.0189 


0.0218, 0.00650 


0.0421, 0.0490 



Table 4: Mean and standard deviation of the BRIE and SUBEX estimator for estimating 
the improvement in the expected error rate when the training set size is increased to n = 50 
to m = 150 and m = 200. The BRIE estimator used an unrestricted multivariate normal 
model to estimate px{x). The BRIE estimator is seen to be significantly less biased than 
the SUBEX estimator, as well as, possessing smaller standard deviation across training sets. 
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p 


r 


T(75) 


fB(75), SD 


fs{75), SD 


r(lOO) 


fij(lOO), SD 


f5(100), SD 


10 


0.10 


0.179 


0.184, 0.0634 


0.166, 0.0735 


0.172 


0.179, 0.0619 


0.152, 0.0828 


15 


0.10 


0.174 


0.175, 0.0636 


0.141, 0.0755 


0.161 


0.164, 0.0639 


0.127, 0.0832 


20 


0.10 


0.176 


0.177, 0.0679 


0.154, 0.0831 


0.161 


0.163, 0.0678 


0.136, 0.0926 


25 


0.10 


0.181 


0.187, 0.0715 


0.174, 0.0851 


0.163 


0.172, 0.0713 


0.151, 0.0994 


15 


0.00 


0.187 


0.190, 0.0655 


0.169, 0.0831 


0.171 


0.179, 0.0664 


0.151, 0.0940 


15 


0.25 


0.155 


0.158, 0.0610 


0.147, 0.0739 


0.147 


0.148, 0.0610 


0.134, 0.0814 


15 


0.50 


0.126 


0.126, 0.0581 


0.110, 0.0686 


0.118 


0.118, 0.0582 


0.100, 0.0748 


15 


0.75 


0.0891 


0.0861, 0.0507 


0.0744, 0.0505 


0.0868 


0.0814, 0.0508 


0.0683, 0.0527 



Table 5: Mean and standard deviation of the BRIE and SUBEX estimator for estimating 
the learning curve when the training set size is increased from n = 50 to m = 75 and 
m = 100. The BRIE estimator used a multivariate normal model with the restriction that 
Sjj = a^p'*"-'' to estimate px{x). The BRIE estimator is seen to be significantly less biased 
than the SUBEX estimator, as well as, possessing smaller standard deviation across training 
sets. 



SUBEX is substantial. BRIE also has a smaller variance than SUBEX, but the advantage is 
substantially smaller than in the case of estimating learning curve increments. This is likely 
due to the fact that the bias correction used in the BRIE method is based on the highly 
variable leave-one-out cross validation estimator of T{n — 1). 

In practice, learning curve estimates are useful to the extent that they can distinguish 
between substantially different possible true learning curve patterns. For estimating learning 
curve increments, with an increase in sample size from 50 to 150 (table 4, left columns), the 
range of possible true increments is roughly 0.05 (0.015 to 0.0684). The standard error for 
the BRIE estimate of these quantities is around 0.01, so the range of possible outcomes 
in our simulations is at least 5 times greater than the standard error. For estimating the 
learning curves themselves, the analogous results in table 6 (left columns) show a range of 
0.08 in the true values, and a standard error of 0.05 — 0.07. Thus the maximum observed 
difference is only slightly greater than the standard error, suggesting that the practical value 
of estimators of the learning curve may be limited, while useful information can be obtained 
from estimates of the learning curve increments. This point is underscored in the example 
considered in the next section. 
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p 


r 


t(150) 


fB(150), SD 


fs(150), SD 


r(200) 


fB(200), SD 


f5(200), SD 


10 


0.10 


0.164 


0.173, 0.0647 


0.139, 0.0911 


0.158 


0.169, 0.0651 


0.133, 0.0945 


15 


0.10 


0.149 


0.153, 0.0643 


0.113, 0.0905 


0.141 


0.146, 0.0644 


0.107, 0.0930 


20 


0.10 


0.141 


0.148, 0.0676 


0.120, 0.100 


0.133 


0.139, 0.0674 


0.112, 0.103 


25 


0.10 


0.143 


0.155, 0.0710 


0.130, 0.111 


0.129 


0.144, 0.0707 


0.122, 0.114 


15 


0.00 


0.160 


0.167, 0.0670 


0.134, 0.101 


0.151 


0.161, 0.0674 


0.127, 0.103 


15 


0.25 


0.131 


0.137, 0.0611 


0.121, 0.116 


0.124 


0.131, 0.0613 


0.116, 0.0917 


15 


0.50 


0.106 


0.101, 0.0583 


0.919, 0.0795 


0.101 


0.105, 0.0583 


0.0886, 0.0811 


15 


0.75 


0.0800 


0.0761, 0.0508 


0.0625, 0.0550 


0.0761 


0.0730, 0.0508 


0.0597, 0.0561 



Table 6: Mean and standard deviation of the BRIE and SUBEX estimator for estimating 
the learning curve when the training set size is increased from n = 50 to m = 150 and 
m = 200. The BRIE estimator used a multivariate normal model with the restriction that 



Zj,-i 



cr^pl* ■'I to estimate px{x) 



The BRIE estimator is seen to be significantly less biased 



than the SUBEX estimator, as well as, possessing smaller standard deviation across training 
sets. 



5 Example: Predicting the four year survival proba- 
bility for CLL 

We next demonstrate the BRIE approach to estimating learning curves using data from 
a study of prognostic factors for outcomes of patients with chronic lymphocytic leukemia 
(CLL). The duration between the time that a subject entered a study and the time the 
subject required treatment (TTFT), a surrogate for disease progression, was obtained for 



209 CLL subjects in a prospective study Ouillette et al. 2011 . For this analysis, the TTFT 



outcomes were dichotomized according to whether treatment was needed within four years 
of diagnosis. Eleven potential prognostic markers were used to predict this outcome using 
logistic regression. The markers are: ZAP70%, p53 status, CD38%, IgVH mutation status, 
age at diagnosis, Rai stage at diagnosis, number of positive lymph node groups, subchro- 
mosomal losses, chromosomal losses, sub chromosomal gains, and chromosomal gains. This 
set of predictor variables includes binary (p53, IgVH), ordered categorical (Rai stage), count 
(chromosomal/subchromosomal losses and gains, positive lymph node groups), and contin- 
uous (ZAP70%, CD38%, age) measures. 
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We calculated the BRIE estimator of the learning curves using two different models for 
the marginal distribution px{x) of predictor variables. The first approach, which we denote 
"GM", used a Gaussian mixture. Here we stratified the training data into four groups, 
according to the joint pattern of values for the two binary variables (p53 and IgVH). Within 
each such group, the mean of the other nine non-binary variables was calculated, and a 
pooled covariance matrix for these nine variables over the four groups (centered at their 
respective means) was calculated. The GM model for the predictor variables was a four 
component Gaussian mixture, where the four components correspond to the four strata 
determined by the joint values of p53 and IgVH. The mixture components had means equal 
to the four stratum means in the training data, a common covariance structure equal to 
the pooled covariance matrix from the data, and marginal frequencies equal to the empirical 
frequencies of the four subgroups of training data defined by p53 and IgVH. 

The second approach to modeling the covariate distribution, which we denote "GC", 
used a Gaussian copula. Here all data were converted to normal scores, then the correlation 
matrix of the normal scores was calculated. To produce simulated data from this model, 
we simulated Gaussian vectors according to this covariance matrix, then transformed each 
component of these vectors with the corresponding inverse normal scores function. The 
resulting model for simulated data has marginal distributions exactly equal to the univariate 
empirical distributions of the training set, and dependence which approximates the training 
set dependence. 

The results are given in table [TI Four different training set sizes are used (50, 75, 100, 
150). For each training set size n, we sampled n observations without replacement from the 
overall CLL data set of 209 observations. These n observations were used for three purposes: 
to estimate the misclassification error rate using cross-validated logistic regression, to fit a 
logistic regression model for PY\x{y\x), and to fit models for the predictor variable distribu- 
tion px{x) using the GM and GC approaches. Together, px{x) and PY\x{y\x) were used to 
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define the data-generating population PyxiVj^) — PY\x{y\x)px{x). We then generated data 
sets of size n, 2n, and 3n from pyxiV} ^)y fit another logistic regression model to each of these 
data sets, and evaluated the accuracy of these fitted rules relative to the data-generating 
population pyj^i^y^x). This process was repeated 1000 times and averaged to produce the 
results in table [71 

Table [T] shows that improvement in prediction accuracy of only 2-3% may occur when 
increasing the training set sample size by factors of 2-3 in this setting. This result is stable 
over the two approaches to modeling the predictor variables. As expected, the magnitudes 
of the gain are greatest for smaller training set sizes, and when the increment in training 
set size is larger (i.e. when comparing fjj{3n) — Tji{n) to f//(2r7,) — fnin)). Overall, this 
analysis suggests that only a small improvement in accuracy is likely to result from increasing 
the training set size in this setting. To achieve more substantial gains in accuracy, more 
informative markers, or a better modeling framework for these 11 markers should be sought. 

As expected, fu underestimates the error rate, especially when the true error rate is high. 
This is a consequence of overfitting, since the strength of association in the fitted logistic 
regression model logitP(y = 1\X = x) = (5'x will tend to be stronger that the strength of 
association in the true model. In particular, if Y and X are independent in the true model, 
the fitted model will still have some association. As expected, this tendency diminishes 
as the sample size grows. As a result, fii{m) tends to increase with m, whereas the CV 
error, and presumably the true error, decrease with n. We note that for larger sample sizes, 
the copula model for px{x) produces BRIE curve estimates that more closely resemble the 
cross-validation results. 

As noted above, fjj is generally too variable to be useful, so we focus on the accuracy 
gain, as shown in the final two columns of table [7} Since we have 209 data points to work 
with, we directly apply cross-validation on subsamples up to size 209 to provide a direct 
cross-vahdation based estimate that is known to be nearly unbiased. For example, under 
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Tn 

n CV n 2n 3n 

GM 50 0.2775 0.2210 0.1968 0.1881 0.0242 0.0329 

GM 75 0.2636 0.2342 0.2157 0.2096 0.0185 0.0246 

GM 100 0.2580 0.2340 0.2201 0.2150 0.0139 0.0190 

GM 150 0.2363 0.2364 0.2264 0.2227 0.0100 0.0137 

GC 50 0.2775 0.2181 0.1922 0.1828 0.0259 0.0353 

GC 75 0.2636 0.2387 0.2196 0.2130 0.0191 0.0257 

GC 100 0.2580 0.2467 0.2316 0.2263 0.0151 0.0204 

GC 150 0.2363 0.2470 0.2358 0.2321 0.0112 0.0149 

Table 7: Learning curve analysis for the CLL data. Columns 7 and 8 show the gain in 

accuracy when the training set sample size increases from n to 2n (i.e. fjj{2n) —fii{n)), and 
from n to 3n (i.e. fjj{3n) — Tjj{n)), respectively. 



the GC model, the error rate is predicted to drop from 0.2181 to 0.1828 when the training 
set sample size grows from 50 to 150, a gain of 0.0353. According to the cross-validation 
estimates, the gain is 0.2775 — 0.2363 = 0.0412. Similarly, when the training set sample size 
grows from 75 to 150, the predicted gain in classification accuracy using the GC model for 
Px{x) is 0.0191, and the corresponding estimate from cross-validation is 0.0273. 



6 Discussion 

We have discussed three relatively simple approaches for estimating the learning curve of 
a classifier. SUBEX methods rely on a parametric model of the learning curve, and use 
unbiased estimates of the classification error rate for sample sizes smaller than the observed 
training set size to estimate the model parameters. IMPINT methods model the data dis- 
tribution, from which the learning curve can be estimated at arbitrary sample sizes without 
the need to model the learning curve. 
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Learning curve estimation is a challenging problem, and neither method considered here 
gives highly accurate results. However, we see that even in the limited range of settings 
considered here, gains in predictive performance ranging from 0.02 to 0.07 can be observed 
for three- fold increases in the training set size (table 4, column 2). For a problem where 
predictive accuracies in the 0.8-0.95 range are typical, knowing that a gain of 0.07 can be 
achieved may lead to a very different strategy for follow-up research compared to knowing 
that only a gain of 0.02 should be expected. The BRIE approach estimates the gain in pre- 
dictive performance nearly unbiasedly, with a standard error of at most 0.01. This provides 
us with power to confidently assess whether we are at the low end or the high end of the 
range of possible gains in performance. 

The SUBEX and IMPINT approaches differ in several major ways, any of which could 
impact their performances. One potential drawback of the SUBEX approach is that the 
inverse power law model for r may not be able to represent the true learning curve. An 
exact analytic expression for r is unlikely to exist, necessitating the use of convenience 
parameterizations such as the inverse power law. Another concern for SUBEX is its use of 
cross-validation, which is known to have high variance (Efron 1983|). This variance may 



propagate to the learning curve estimate. The IMPINT method is not subject to these 
limitations, but concerns may arise about the need to estimate the data generating model, 
which is not necessary for the SUBEX approach. It is unclear which features of the data 
generating model are critical for learning curve estimation. At a minimum, the dimension 
and some measure of the strength of the predictive relationship are clearly relevant. 

As noted above, the SUBEX approach models the learning curve, while the IMPINT 
approach models the full data distribution. The learning curve is a simpler object than the 
data distribution, hence SUBEX seems to require fewer assumptions. However, the learning 
curve is not directly observed. Any appropriate statistical modeling framework can be used to 
attain an estimate of the data generating model, and diagnostic and other tools are available 
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to assess the fit of tlie model. Analogous tools for assessing the appropriateness of the 
learning curve model used by the SUBEX procedure are not readily available. Furthermore, 
the errors in the SUBEX procedure will be amplified by the need to extrapolate beyond the 
range of sample sizes that are directly estimated using subsampling. No analogous source of 
variation seems to be present in the IMPINT approach. 

Learning curves have the potential to become a useful tool in applied statistics. One 
relevant analogy is to the widely-practiced fields of power analysis and sample size planning. 
In this setting, preliminary estimates of effect sizes are used to assess the power for various 
study designs. Learning curves can be viewed as a power analysis tool to be used when the 
research aims involve prediction, rather than focusing on estimation and hypothesis testing. 
As in classical power analysis, over-reliance on point estimates from small pilot studies may 
not be advised. In practice it would be advisable to consider a range of possibilities for key 
population parameters and attempt to delineate those situations where substantial gains in 
predictive performance are expected to occur. 
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